Urban Greenspace and Asthma Prevalence in Columbus, Ohio in 2023¶

Max Warnock
2/3/26

Introduction¶

There are many studies pointing to the value of urban green space in promoting human physical and mental health. Benefits of urban green space include offsetting greenhouse gas emissions, reducing urban heat island effects, and collecting storm water, and improving air quality. Urban green spaces can also provide increased opportunities for outdoor physical activity and social interactions which can improve mental health (Lee, et al).

For this project, I will be studying the relationship between urban green space and asthma rates in Columbus, Ohio. According to the Ohio Department of Health, Columbus has some of the highest asthma rates in the Midwest. Vegetation metrics that quantify the structure and connectivity of greenspace have found correlations to increased human health, as shown in studies by Tsai et al, and Dong et al. To quantify green space, I will be analyzing vegetation metrics of mean patch size, edge density, and fragmentation.

In this notebook below, I will use Python code to calculate patch, edge, and fragmentation statistics for urban greenspace in Columbus, Oh. These statistics should be reflective of the connectivity and spread of urban greenspace, which are important for ecosystem function and access. I will then use a linear model to identify statistically significant correlations between the distribution of greenspace and health data compiled by the US Center for Disease Control.

Data¶

Centers for Disease Control (CDC) Places Data.¶

I will be using this data source for asthma data by census tract in Columbus, OH in 2023. This dataset also provides modified census tracts clipped to the city boundary which will allow me to map the data more easily.

National Agricultural Imagery Program (NAIP) Data.¶

For vegetation data, I will be using NAIP data which has a very high resolution of 1m. This data will allow me to better understand the structure of the greenspace, rather than simply quantify it which could be done with Landsat or MODIS data (lower resolution, 250m). Since this is a very large dataset, I will be using a cloud-based solution (Microsoft Planetary Computer) to handle the data download.

Step 1. Setting up the analysis¶

In [ ]:
# Install sodapy if you haven't already
# !pip install sodapy
In [1]:
### Import libraries


### File paths and organization
import os 
import pathlib


### See Warnings
import warnings


### Different data types
import pandas as pd
import geopandas as gpd
import numpy as np 
import xarray as xr
import rioxarray as rxr
import rioxarray.merge as rxrmerge
import shapely
import time


### plotting and visualization packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas
import hvplot.xarray
from cartopy import crs as ccrs


### data exploration and visualization
import matplotlib
import scipy.stats as stats
import matplotlib.pyplot as plt
from bokeh.models import FixedTicker

### image processing
from scipy.ndimage import convolve
from scipy.ndimage import label


### working with data in the cloud
import pystac_client
from sodapy import Socrata


### progress bar
from tqdm.notebook import tqdm


### OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

Create reproducible file paths within your home directory

In [2]:
### Create reproducible file paths
data_dir = os.path.join(
    
    ### home directory
    pathlib.Path.home(),
    
    ### Earth analytics folder
    "earth-analytics","data",
    
    ### Project directory
    "columbus-asthma",
)

### make the directory
os.makedirs(data_dir,exist_ok=True)

Provent code disruptions from potential momentary connection lapse. This is very important for the long process of downloading NAIP data.

In [3]:
### revent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

Step 2. Create a Site Map¶

Create a quick visualization of the city of Columbus census tracts (from the CDC Places dataset) overlaid on basic Esri satellite imagery. Since there are two cities named Columbus in the U.S, I had to select Columbus OH, using it's designated city code.

In [44]:
### Set up the census tract path

### Define directory
tract_dir = os.path.join(data_dir,"columbus-tract")
os.makedirs(tract_dir,exist_ok=True)

### make path to directory
tract_path = os.path.join(tract_dir,"columbus_tract.shp")

### Load in census tract data
if not os.path.exists(tract_path):

    ### put in census url
    tract_url = ("https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip")

    ### read the shape file
    tract_gdf = gpd.read_file(tract_url)

    ### subset to Columbus Ohio
    columbus_tract_gdf = tract_gdf[(tract_gdf["PlaceName"] == "Columbus city") & (tract_gdf["ST"] == "39")]
    
    ### save it as a file
    columbus_tract_gdf.to_file(tract_path,index=False)

### Load in the census tract data
columbus_tract_gdf = gpd.read_file(tract_path)
columbus_tract_gdf

### Site plot -- Census tracts with satellite imagery in the background
(
    columbus_tract_gdf

    .to_crs(ccrs.Mercator())

    ### plot with satellite imagery in the background
    .hvplot(
        line_color = 'orange', fill_color = None,
        crs = ccrs.Mercator(), tiles = 'EsriImagery',
        frame_width = 600, title = "Columbus, OH Census Tracts"
    )
)
Out[44]:

Figure 1: From the visualization above, it's clear that Columbus has a mix of highly developed areas, green areas, and some rivers and water ways. Additionally, the surrounding area appears to be farm land.

STEP 3 - Access Asthma and Urban Greenspaces Data¶

Step 3a. In the step below, I will download the asthma data from the the CDC Places Dataset for 2023.

In [ ]:
### Set up a path for the asthma data
cdc_path = os.path.join(data_dir, 'asthma.csv')

### Download asthma data (only once)
if not os.path.exists(cdc_path):

    cdc_url = (
        
        ### url for data
        "https://data.cdc.gov/resource/cwsq-ngmh.csv"

        ### parameters to filter on (Franklin is the county for Columbus)

        "?year=2023"
        "&stateabbr=OH"
        "&countyname=Franklin"
        "&measureid=CASTHMA"
        "&$limit=1500"

    )

    cdc_df = (

        ### open it as a csv
        pd.read_csv(cdc_url)

        ### rename the cols
        .rename(columns = {
            'data_value': 'asthma',
            'low_confidence_limit': 'asthma_ci_low',
            'high_confidence_limit': 'asthma_ci_high',
            'locationname': 'tract'})

            ### select columns I want to keep
            [[
                'year', 'tract',
                'asthma', 'asthma_ci_low', 'asthma_ci_high',
                'data_value_unit', 'totalpopulation',
                'totalpop18plus'
            ]]
    )
else:
    cdc_df = pd.read_csv(cdc_path)

### Load in asthma data
cdc_df.to_csv(cdc_path, index = False)

### Preview asthma data
cdc_df
Out[ ]:
year tract asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus
0 2023 39049009397 12.8 11.4 14.1 % 2977 2418
1 2023 39049006237 9.7 8.7 10.8 % 5953 5051
2 2023 39049007114 12.0 10.8 13.3 % 5130 3943
3 2023 39049010204 12.4 11.2 13.8 % 6684 4922
4 2023 39049007961 10.0 8.9 11.1 % 4974 3401
... ... ... ... ... ... ... ... ...
322 2023 39049006392 9.4 8.5 10.5 % 4644 3485
323 2023 39049006302 9.9 8.9 11.0 % 4613 3674
324 2023 39049000600 11.1 9.9 12.5 % 3839 3492
325 2023 39049006923 11.5 10.3 12.8 % 3957 2866
326 2023 39049009373 13.1 11.8 14.5 % 6184 4241

327 rows × 8 columns

Step 3b. Join CDC asthma data with census tract boundaries and plot the result.

In [52]:
### Change tract identifier datatype for merging
columbus_tract_gdf.tract2010 = columbus_tract_gdf.tract2010.astype('int64')

### Merge census data with geometry
tract_cdc_gdf = (
    columbus_tract_gdf
    .merge(cdc_df,
           
           ### specify cols to merge on
           left_on = 'tract2010',
           right_on = 'tract',

           ### use inner join

           how = 'inner'           
           )
)

# tract_cdc_gdf

# Numeric ticks to keep on the colorbar
numeric_ticks = [10, 11, 12, 13, 14, 15]

# markers
name_ticks = [10, 15]

# Combine them
all_ticks = numeric_ticks + name_ticks

(
    ### load basemap with satellite imagery
    gv.tile_sources.EsriImagery

    *

    ### map census tracts with asthma data
    gv.Polygons(

        ### reproject to align CRS
        tract_cdc_gdf.to_crs(ccrs.Mercator()),

        ### select cols
        vdims=['asthma', 'tract2010'],

        ### ensure CRSs align
        crs=ccrs.Mercator()

    ).opts(
        color='asthma',
        colorbar=True,
        tools=['hover'],
        title="Columbus, OH Asthma Rates in 2023",
        clabel='Asthma rate',
        colorbar_opts={
            'ticker': FixedTicker(ticks=all_ticks),
            'major_label_overrides': {
                10: 'Lower asthma rate',
                15: 'Higher asthma rate'
            }
        }
    )
).opts(width=600, height=600, xaxis=None, yaxis=None)
Out[52]:

Figure 2: This map shows the CDC Places data for asthma rates in Columbus. The darker blues represent high asthma rates than lighter colors. Unfortunately, it appears that some census tracts do not have asthma data. They are represented as clear on the map. However, it appears there is still data for most of the city, so I think it's okay to preceed with the anaylsis. There are some interesting spatial trends. For example, the northwest side of the city appears to have significantly lower asthma rates than the central and southern parts of the city. The map also generally agrees with my previous research that Columbus in general has high rates of asthma (Ohio Department of Health).

Step 3c. Get URLs for urban greenspace imagery.

Below, I'm connecting to the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).

In [7]:
### Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)

The loop below collects URLs of NAIP imagery for each tract that we are studying in Columbus. It will not collect data for census tracts that we do not have asthma data for.

This loop goes through every tract value in tract_latlong_gdf. For each tract, the loop extracts the tract id, checks if the tract has been downloaded yet, searches for NAIP imagery from the STAC, builds a dataframe with the tract ID, date, and URL, and has built in backups in case there are internet disruptions. After the loop has finished, the dataframes are concatenated into a complete object (scene_df) that contains all the tracts, dates, and urls.

In [53]:
### Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)

### Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, "columbus-ndvi-stats.csv")

### Check for existing data - do not access duplicate tracts
### initialize an empty list for the census tract IDs
downloaded_tracts = []

### write the conditional
if os.path.exists(ndvi_stats_path):

    ### if it exists, open the csvs in the path
    ndvi_stats_dr = pd.read_csv(ndvi_stats_path)

    ### fill the list with the tract values
    downloaded_tracts = ndvi_stats_dr.tract.values

### if it doesn't exist, print statement
else:
    print("No census tracts downloaded so far")

### Loop through each census tract

### list to accumulate into 
scene_dfs = []

### loop through each tract value in the gdf
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()):

    ### for i, extract the tract id
    tract = tract_values.tract2010

    ### Check if statistics are already downloaded for this tract
    if not (tract in downloaded_tracts):


        ### Repeat up to 5 times in case of a momentary disruption 
        i = 0 ### initialize retry counter
        retry_limit = 5 ### max number of tries
        while i < retry_limit:

            ### Try accessing the STAC
            try:

                ### Search for tiles 
                naip_search = e84_catalog.search(
                    collections = ['naip'],
                    intersects = shapely.to_geojson(tract_values.geometry),
                    datetime="2021"
                )

                ### Build dataframe with tracts and tile urls
                scene_dfs.append(pd.DataFrame(dict(

                    ### column called tract
                    tract = tract,

                    date = [pd.to_datetime(scene.datetime).date()
                            
                            for scene in naip_search.items()],

                    rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
                
                
                )))

                ### exit retry loop once STAC request succeeds
                break

            ### Try again in case of an APIError
            except pystac_client.exceptions.APIError:
                print(
                    f'Could not connect with STAC server.'
                    f'Retrying tract {tract}...')
                
                ### wait 2 seconds before retrying
                time.sleep(2)
                i += 1
                continue
                

### Concatenate the url dataframes
if scene_dfs:

    ### concatenate
    scene_df = pd.concat(scene_dfs).reset_index(drop = True)

### otherwise
else:
    scene_df = None

### Preview the url dataframe
scene_df
0it [00:00, ?it/s]
In [ ]:
# Save the dataframe to a csv
scene_df.to_csv(ndvi_stats_path, index=False)

Step 3d. Compute NDVI Statistics

Here are the three different vegetation statistics that we will be calculating:

  • Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat)

  • Mean patch size: The average size of patches, or contiguous area exceeding the vegetation threshold. Patches can be identified with the label function from scipy.ndimage

  • Edge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.

In [19]:
### make csv for ndvi data
ndvi_stats_path_veg = os.path.join(data_dir, 'columbus-ndvi-stats-veg.csv')

Loop to compute NDVI statistics¶

An important goal of this loop is to compute NDVI statistics per census tract. Therefore, we need to clip the NAIP imagery to each tract. The below loop goes through each census tract of interest and clips NAIP imagery to the geometry of that census tract. Then, we compute NDVI statistics for the clipped NAIP imagery.

Mean Patch Size: From the NDVI statistics, we are able to create a vegetation mask where we select all pixels above a threshold. In this case, 0.3, meaning we are selecting all pixels where vegetation was above 0.3 NDVI. We then use this in conjunction with a count of the total pixels in that tract to calculate mean patch size. We use the 'label' function to label patches where the pixels are above 0.3.

Edge Density: To identify edges, we will use a kernel to identify from the vegetation mask, which vegetation pixels are edge pixels. The kernel is applied to all pixels in the tract (this is sometimes referred to as a moving window or focal statistics). Because we are using a kernel, the outer edge pixels of the tract will be lost.

Below is the 3x3 kernel that I will be using. The code goes through each pixel and multiplies the pixel value by the kernel.

$$ \text{Kernel} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$

Kernel results: If the center pixel is the same as the surrounding pixels, its value in the final matrix will be $-8 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 0$. If it is higher than the surroundings, the result will be negative, and if it is lower than the surroundings, the result will be positive. As such, the edge pixels of our patches will be negative.

Kernel and explanation of how it works credit: CU Boulder Earth Lab

Percent Vegetation: We can calculate percent vegetation simply by dividing the number of vegetation pixels (veg_mask) by the total pixels in the tract.

This loop runs through all these steps to calculate mean patch size, edge density, and percent vegetation. It collects all these calculated statistics in a dataframe which can be saved as a csv file. This is a long process and the loop may take several hours.

In [ ]:
### Skip this step if data are already downloaded 
if not scene_df is None:

    ### Loop through the census tracts with URLs
    for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):

        ### Open all images for tract
        ### create a list to store NDVI arrays, with 1 array per NAIP image
        tile_das = []
        
        ### iterate over rows  in tract specific dataframe
        for _, href_s in tract_date_gdf.iterrows():

            ### Open vsi connection to data
            tile_da = rxr.open_rasterio(

                ### mask and squeeze
                href_s.rgbir_href, masked = True).squeeze()        
            
            ### Clip data
            boundary = (
                tract_cdc_gdf
                .set_index('tract2010')
                .loc[[tract]]
                .to_crs(tile_da.rio.crs)
                .geometry
            )

            ### Crop to bounding box
            crop_da = tile_da.rio.clip_box(
                *boundary.envelope.total_bounds,
                auto_expand = True
            )

            ### clip to actual tract geometry
            clip_da = crop_da.rio.clip(boundary, all_touched = True)
                
            ### Compute NDVI
            ndvi_da = (
            (clip_da.sel(band = 4) - clip_da.sel(band = 1))
            / (clip_da.sel(band = 4) + clip_da.sel(band = 1)))

            ### Accumulate result
            tile_das.append(ndvi_da)

        ### Merge data
        scene_da = rxrmerge.merge_arrays(tile_das)

        ### Mask vegetation
        veg_mask = (scene_da > .3)

        ### Calculate statistics and save data to file

        ### count all valid pixels in the tract
        total_pixels = scene_da.notnull().sum()

        ### count all vegetated pixels
        veg_pixels = veg_mask.sum()

        ### identify vegetation patches
        labeled_patches, num_patches = label(veg_mask)

        ### count patch pixels
        patch_sizes = np.bincount(labeled_patches.ravel())[1:]

        ### mean patch size
        mean_patch_size = patch_sizes.mean()
        

        ### Calculate edge density


        ### define the kernel
        kernel = np.array([
            [1, 1, 1], 
            [1, -8, 1], 
            [1, 1, 1]])
        
        ### detect boundaries between veg and non-veg
        edges = convolve(veg_mask, kernel, mode='constant')

        ### calculate proportion between veg and non-veg
        edge_density = np.sum(edges != 0) / veg_mask.size

        ### Add a row to the statistics file for  this tract
        ### create new dataframe for the tract and append it to the csv
        pd.DataFrame(dict(

            ### store tract ID
            tract = [tract],

            ### store total pixels
            total_pixels = [int(total_pixels)],

            ### fraction of veg pixels
            frac_veg = [float(veg_pixels / total_pixels)],

            ### store the mean patch size
            mean_patch_size = [mean_patch_size],

            ### edge density 
            edge_density = [edge_density]

        ### write out as a csv 
        )).to_csv(
            ndvi_stats_path_veg,

        ### append mode
        mode = 'a',

        ### get rid of row numbers
        index = False, 

        ### write column names 
        header = (not os.path.exists(ndvi_stats_path_veg))
        )
In [21]:
### check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
Out[21]:
tract total_pixels frac_veg mean_patch_size edge_density
0 39049000110 5458912 0.467575 148.579661 0.189672
1 39049000120 8008243 0.516840 163.790344 0.125395
2 39049000210 4766663 0.532251 206.634631 0.179721
3 39049000220 5410607 0.385892 134.296520 0.114323
4 39049000310 6862911 0.210889 87.108697 0.157271
... ... ... ... ... ...
199 39049010100 17816122 0.260141 80.683443 0.144937
200 39049010300 6793594 0.062171 31.383935 0.043643
201 39049010601 2797819 0.475241 243.612862 0.107467
202 39049010602 2386501 0.497720 211.128688 0.093800
203 39049010700 4169917 0.544232 239.894609 0.169301

204 rows × 5 columns

STEP 4 - Explore the data with plots¶

First, we need to merge all our data together into one geodataframe that we can use for plotting. See below.

In [22]:
### Merge census data with geometry
columbus_ndvi_cdc_gdf = (

    ### grab census tract gdf
    tract_cdc_gdf

    ### merge with ndvi df
    .merge(
        ndvi_stats_df,
        ### match on the tract ID
        ### left: tract2010
        ### right: tract
        left_on = 'tract2010',
        right_on = 'tract',
        how = 'inner'
    )
)

### check it out
columbus_ndvi_cdc_gdf
Out[22]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract_x asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus tract_y total_pixels frac_veg mean_patch_size edge_density
0 3918000 39049000110 39 Columbus 3918000-39049000110 3344 POLYGON ((-9241687.822 4875035.426, -9241688.4... 2023 39049000110 10.3 9.1 11.4 % 3489 3015 39049000110 5458912 0.467575 148.579661 0.189672
1 3918000 39049000120 39 Columbus 3918000-39049000120 3162 POLYGON ((-9239407.998 4873815.462, -9239390.7... 2023 39049000120 10.0 8.9 11.1 % 3220 2716 39049000120 8008243 0.516840 163.790344 0.125395
2 3918000 39049000210 39 Columbus 3918000-39049000210 2935 POLYGON ((-9239360.353 4872960.063, -9239346.3... 2023 39049000210 10.2 9.1 11.4 % 3049 2512 39049000210 4766663 0.532251 206.634631 0.179721
3 3918000 39049000220 39 Columbus 3918000-39049000220 3727 POLYGON ((-9239346.327 4872707.908, -9239330.8... 2023 39049000220 10.0 8.9 11.1 % 4186 3398 39049000220 5410607 0.385892 134.296520 0.114323
4 3918000 39049000310 39 Columbus 3918000-39049000310 3119 POLYGON ((-9239195.377 4872909.315, -9239210.7... 2023 39049000310 11.6 10.4 13.0 % 3377 2650 39049000310 6862911 0.210889 87.108697 0.157271
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
199 3918000 39049007922 39 Columbus 3918000-39049007922 0 POLYGON ((-9254031.153 4874089.169, -9254045.3... 2023 39049007922 10.3 9.2 11.5 % 6201 4821 39049007922 46829 0.352816 58.381625 0.038817
200 3918000 39049007931 39 Columbus 3918000-39049007931 0 MULTIPOLYGON (((-9252946.678 4868317.011, -925... 2023 39049007931 10.5 9.3 11.7 % 4305 3181 39049007931 3097110 0.028144 142.893443 0.012675
201 3918000 39049008230 39 Columbus 3918000-39049008230 0 POLYGON ((-9252589.564 4857148.339, -9252620.5... 2023 39049008230 14.3 12.9 15.9 % 3178 1792 39049008230 138983 0.105373 143.578431 0.066807
202 3918000 39049008500 39 Columbus 3918000-39049008500 0 POLYGON ((-9241942.521 4861874.191, -9241967.7... 2023 39049008500 10.1 9.0 11.4 % 5702 4784 39049008500 2643186 0.119766 34.042908 0.061626
203 3918000 39049009386 39 Columbus 3918000-39049009386 0 MULTIPOLYGON (((-9218532.58 4856069.647, -9218... 2023 39049009386 13.4 12.1 14.8 % 3279 2263 39049009386 2486908 0.011225 20.346939 0.014952

204 rows × 20 columns

Next, we will make a function for plotting maps as chloropleths. Since we are plotting a lot of maps, using a function is more efficient.

In [23]:
### Plot chloropleths with vegetation statistics

### make a plot_chloropleth function

def plot_chloropleth(gdf, **opts):
    ### docstring to describe the function
    """Generate a chloropleth with the given color column"""

    ### use geoviews to make a polygon object
    return gv.Polygons(

        ### use mercator
        gdf.to_crs(ccrs.Mercator()),

        ### set plot crs to mercator
        crs = ccrs.Mercator()

    ### drop x and y axis and add a legend
    ).opts(xaxis = None, yaxis = None, colorbar = True, **opts)

Apply the function to Columbus asthma and vegetation data

In [59]:
### apply the function to Columbus asthma and vegetation data
(
    plot_chloropleth(

        ### plot asthma by census tract
        columbus_ndvi_cdc_gdf, 
        color = 'asthma', 
        cmap = 'viridis', 
        title = 'Asthma Rates in 2023',
        clabel='Low asthma rate                  High asthma rate',)
        +

        ### add a second plot with vegetation edge density 
        plot_chloropleth(
            columbus_ndvi_cdc_gdf, 
            color = 'edge_density', 
            cmap = 'Greens', 
            title = 'Vegetation Edge Density in 2021',
            clabel='Low edge den                 High edge den')
        +

        ### add a third plot with fraction of vegetation per tract 
        plot_chloropleth(
            columbus_ndvi_cdc_gdf, 
            color = 'frac_veg', 
            cmap = 'Greens', 
            title = 'Percent Vegetation in 2021',
            clabel='Low % vegetation                  High % vegetation')
        +

        ### add a second plot with vegetation mean patch size
        plot_chloropleth(
            columbus_ndvi_cdc_gdf, 
            color = 'mean_patch_size', 
            cmap = 'Greens', 
            title = 'Mean Vegetation Patch Size (MPS) in 2021',
            clabel='Low MPS                          High MPS')
).cols(2) 
### add 2 columns to plot the maps in a 2x2 grid
Out[59]:

Figure 3: Chloropleth maps showing asthma rates as well as the other vegetatiom metrics that we calculated above.

STEP 5: Exploring a linear ordinary least-squares regression¶

To identify if there is a statistically significant relationship between asthma prevalence and greenspace metrics, we can run a linear ordinary least squares (OLS) regression and measure how well it is able to predict asthma with the given vegetation metics.

OLS has 5 main assumptions:

  1. The relationship between x and y is linear.
  2. Errors are normally distributed.
  3. Independence: In the case that there are multiple x variables, colinearity needs to be avoided in order to avoid making false correlations with the wrong variable. (Not the case for this assignement since we just have asthma vs. greenspace).
  4. Stationarity: The parameters of the model do not vary over time.
  5. Complete observations: Can't have "No Data" values.

Model Objective:
The objective of this model is to determine if there is a statistically significant relationship between asthma rates and greenspace values. For this assignment, we are comparing asthma rates to three potential different vegetation metrics: mean patch size, edge density (fragmentation), and percent vegetation (frac_veg). We need to plot the data for each of these metrics to see if they fit the assumptions of OLS. In particular, we want to look for a linear relationship between x and y, and a normal distribution.

Advantages and potential problems:
If in fact greenspace and asthma rates are correlated in some way, this is a good model to find that correlation. Potential problems with using this model is that these are big datasets, and there may be issues like uneven distributions. Additionally, there are likely other variables besides greenspace, such as other environmental factors, economic, racial groups, healthcare, etc that could affect the data. The OLS model also does not account for spatial autocorrelation.

In [60]:
### Visualize distribution of data

### select variables
variables = ['frac_veg','asthma','mean_patch_size','edge_density']

hvplot.scatter_matrix(

    columbus_ndvi_cdc_gdf
    [variables]
)
Out[60]:

Figure 4: Histogram and scatter plots comparing all variables with each other in proparation for fitting an OLS model. The above histograms show fairly normal distributions for both edge density and percent vegetation (frac_veg) statistics. However, asthma and mean patch size are somewhat skewed, so they are candidates for log transformation. Additionally, in the scatter plots, some of the highest levels of correlation appears between edge density and frac_veg.

In [61]:
### histograms (another way of making them)

### make facet
fig, axes = plt.subplots(2, 2, figsize = (12, 10))

### list variables to plot
variables = ['frac_veg','asthma','mean_patch_size','edge_density']
titles = ['Vegetated fraction','Adult asthma rate','Mean patch size','Edge density']

### loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
    ax = axes[i//2, i%2]
    ax.hist(columbus_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
    ax.set_title(title)
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')

### adjust layers to prevent overlap
plt.tight_layout()
plt.show()
No description has been provided for this image

Figure 5: Demonstration of how to plot the variables simply has histograms (same results).

Drop missing values as these could skew the model.

In [ ]:
### Drop missing observations 
model_df = (
    columbus_ndvi_cdc_gdf

    ### make a copy
    .copy()

    ### subset to columns
    [['frac_veg','asthma','mean_patch_size','edge_density','geometry']]

    ### drop rows (observations) with missing values
    .dropna()
)

model_df
Out[ ]:
frac_veg asthma mean_patch_size edge_density geometry
0 0.467575 10.3 148.579661 0.189672 POLYGON ((-9241687.822 4875035.426, -9241688.4...
1 0.516840 10.0 163.790344 0.125395 POLYGON ((-9239407.998 4873815.462, -9239390.7...
2 0.532251 10.2 206.634631 0.179721 POLYGON ((-9239360.353 4872960.063, -9239346.3...
3 0.385892 10.0 134.296520 0.114323 POLYGON ((-9239346.327 4872707.908, -9239330.8...
4 0.210889 11.6 87.108697 0.157271 POLYGON ((-9239195.377 4872909.315, -9239210.7...
... ... ... ... ... ...
199 0.352816 10.3 58.381625 0.038817 POLYGON ((-9254031.153 4874089.169, -9254045.3...
200 0.028144 10.5 142.893443 0.012675 MULTIPOLYGON (((-9252946.678 4868317.011, -925...
201 0.105373 14.3 143.578431 0.066807 POLYGON ((-9252589.564 4857148.339, -9252620.5...
202 0.119766 10.1 34.042908 0.061626 POLYGON ((-9241942.521 4861874.191, -9241967.7...
203 0.011225 13.4 20.346939 0.014952 MULTIPOLYGON (((-9218532.58 4856069.647, -9218...

204 rows × 5 columns

Below I'm performing a log transformation for the asthma and mean patch size variables since both of these were somewhat skewed, and a normal distribution is more optimal for an OLS model.

In [37]:
### Perform variable transformation

### log of asthma rate (to try to make it more normally distributed)
model_df['log_asthma'] = np.log(model_df.asthma)

### log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)

model_df
Out[37]:
frac_veg asthma mean_patch_size edge_density geometry log_asthma log_edge_den log_patch
0 0.467575 10.3 148.579661 0.189672 POLYGON ((-9241687.822 4875035.426, -9241688.4... 2.332144 -1.662459 5.001121
1 0.516840 10.0 163.790344 0.125395 POLYGON ((-9239407.998 4873815.462, -9239390.7... 2.302585 -2.076290 5.098587
2 0.532251 10.2 206.634631 0.179721 POLYGON ((-9239360.353 4872960.063, -9239346.3... 2.322388 -1.716350 5.330952
3 0.385892 10.0 134.296520 0.114323 POLYGON ((-9239346.327 4872707.908, -9239330.8... 2.302585 -2.168725 4.900050
4 0.210889 11.6 87.108697 0.157271 POLYGON ((-9239195.377 4872909.315, -9239210.7... 2.451005 -1.849786 4.467157
... ... ... ... ... ... ... ... ...
199 0.352816 10.3 58.381625 0.038817 POLYGON ((-9254031.153 4874089.169, -9254045.3... 2.332144 -3.248902 4.067001
200 0.028144 10.5 142.893443 0.012675 MULTIPOLYGON (((-9252946.678 4868317.011, -925... 2.351375 -4.368087 4.962099
201 0.105373 14.3 143.578431 0.066807 POLYGON ((-9252589.564 4857148.339, -9252620.5... 2.660260 -2.705946 4.966881
202 0.119766 10.1 34.042908 0.061626 POLYGON ((-9241942.521 4861874.191, -9241967.7... 2.312535 -2.786673 3.527622
203 0.011225 13.4 20.346939 0.014952 MULTIPOLYGON (((-9218532.58 4856069.647, -9218... 2.595255 -4.202937 3.012930

204 rows × 8 columns

In [38]:
### Visualize transformed variables
hvplot.scatter_matrix(
    model_df
    [[
        'frac_veg',
        'edge_density',
        'log_patch',
        'log_asthma'
    ]]
)
Out[38]:

Figure 6: Results after performing the log transformation on the asthma and mean patch size variables. The variables are all much more normally distributed on the histograms after this transformation.

To verify our visual interpretation of the normal distribution of the histograms above, we can plot with Q-Q Plots to better visualize this.

In [40]:
### q-q plots

### set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'log_asthma']

### make q-q plot for each variables
plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):

    ### make 2x2 facet
    plt.subplot(2, 2, i)

    ### norm distribution
    stats.probplot(model_df[var], dist = "norm", plot = plt)

    ### add title
    plt.title(f'Q-Q plot for {var}')

### plot it
plt.tight_layout()
plt.show()
No description has been provided for this image

Figure 7: Q-Q Plots showing distribution of each of the variables of interest. The data are generally very close to the red line (normal distribution) in all of the plots above, meaning they are all very close to a normal distribution.

Step 5b. The data is now prepared as much as possible for running an OLS model. Next, I will select the predictor and outcome variables (vegetation metrics predicting asthma), and then split the data into training and testing groups using the train_test_split function from sklearn. I will set 33% of the data as testing, and the remaining data as training data. I will then use the sklearn LinearRegression function and run this on the training data. I will then use this to predict asthma values for the test dataset, and then compare the predicted values to the actual asthma values.

In [62]:
### Select predictor and outcome variables
X = model_df[['frac_veg','log_patch','edge_density']]
y = model_df[['log_asthma']]

### Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size = 0.33, random_state = 19
)

### Fit a linear regression
reg = LinearRegression()

### fit it with our data
reg.fit(X_train, y_train)

### Predict asthma values for the test dataset
y_test['pred_asthma'] = np.exp(reg.predict(X_test))

### real asthma rate for conparison
y_test['asthma'] = np.exp(y_test.log_asthma)

### check it out
y_test
Out[62]:
log_asthma pred_asthma asthma
62 2.525729 12.478952 12.5
129 2.564949 12.090072 13.0
152 2.406945 11.081863 11.1
94 2.442347 11.657036 11.5
188 2.332144 11.807511 10.3
... ... ... ...
61 2.322388 11.906692 10.2
144 2.360854 10.888511 10.6
201 2.660260 10.764413 14.3
131 2.468100 11.986769 11.8
36 2.517696 11.780745 12.4

68 rows × 3 columns

Visualizing the results below.

In [42]:
### Plot measured vs. predicted asthma prevalence with a 1-to-1 line
y_max = y_test.asthma.max()

(
    y_test
    .hvplot.scatter(
        x='asthma', 
        y='pred_asthma',
        xlabel = "Measured adult asthma prevalence",
        ylabel = "Predicted adult asthma prevalence",
        title = "Linear regression performance - testing data")
    .opts(aspect='equal', 
          xlim=(0, y_max), ylim=(0, y_max), 
          width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
Out[42]:

Figure 8: This plot shows how well the model predicted the actual measured asthma values. There are a few data points that fall directly on the line which indicates an almost perfect prediction. However, the majority of the points fall in a cluster on either side of the line, meaning there were many over and under predictions of asthma values.

Step 5c. To determine if there are any spatial patterns arising from the OLS model, we can plot the results spatially on the census tract map. To do this, we can use the 'model_df' dataframe and subtract the predicted value of asthma from the actual value for each census tract, and then plot the results using a chloropleth map.

In [68]:
### Compute model error for all census tracts

### use the trained model to predict asthma prevalence for each census tract
model_df['pred_asthma'] = np.exp(reg.predict(X))

### calculate model error
model_df['error_asthma'] = model_df['pred_asthma'] - model_df['asthma']

### Plot error geographically as a chloropleth
(
    plot_chloropleth(
        model_df,
        color='error_asthma',
        cmap='RdBu',
        title='OLS Model Error for Predicting Asthma Rates in Columbus, OH',
        clabel='Under Prediction                                                                                                                     Over Prediction'
    )
    .opts(frame_width=600, aspect='equal')
)
Out[68]:

Figure 9: The map above show the how well the OLS model predicted actual asthma results for each census tract in Columbus. Red colors indicate an under prediction, and blues indicate an over prediction. Results show a consistent over prediction on the northwest side of the city. When compared to the asthma map created with the CDC data (Figure 2), this area of the city had some of the lowest asthma rates. A converse example is found in the central part of the city, where there is a consistent underprediction in asthma values. Again, comparing to Figure 2, this central part of the city is where asthma values were high.

Analysis of results:
These results show spatial patterns where the OLS model over and under predicted asthma rates. This spatial pattern is a common concept in geography called "spatial autocorrelation", where similar phenonmena are closer together than dissimilar phenomena. In this case, the lower and higher levels of asthma are generally clumped together in distinct areas of the city.

One flaw of OLS modeling is that it doesn't account for spatial autocorrelation. One way to proceed with this analysis further might be to analyze the city in different areas based on their asthma levels. For example, from the asthma map created from CDC data (Figure 2), all census tracts above and below a certain asthma threshold could be selected. Then, two new models could be created for the high and low areas of asthma. These two new models could then be used to look at vegetation metrics within these areas, and potentially compare the models and results for high and low asthma areas.

Conclusion:
Finding a conclusive correlation between two separate variables such as vegetation and asthma can be difficult as this analysis has shown. In this case, I think that some of the strongest evidence for correlation between asthma and vegetation can be found in the spatial comparisons of the asthma and vegetation metrics, as seen in Figure 3. In this example, it's clear that the northwest area of the city has some of the lowest asthma levels, as well as a significantly higher percentage of vegetation than most other parts of the city. Additionally, the vegetation mean patch size is higher in the northwest area. Edge density in this area is at moderate levels, and a correlation between this metric and asthma is still unclear. Overall, the visual results from Figure 3 suggest that a relationship between lower asthma rates and vegetation coverage is likely a real phenomenon in Columbus, OH.

The results of the OLS model shows that this type of model is vulnerable to spatial autocorrelation. It may be useful for finding the average value of asthma for a given region, but it is problematic for predicting asthma values for a specific tract. Modeling on a finer scale may be necessary to preduce better results on a tract-by-tract basis.

Works Cited:

Lee AC, Jordan HC, Horsley J. Value of urban green spaces in promoting healthy living and wellbeing: prospects for planning. Risk Manag Healthc Policy. 2015 Aug 27;8:131-7. doi: 10.2147/RMHP.S61654. PMID: 26347082; PMCID: PMC4556255.

Tsai, Wei-Lun, Yu-Fai Leung, Melissa R. McHale, Myron F. Floyd, and Brian J. Reich. 2019. “Relationships Between Urban Green Land Cover and Human Health at Different Spatial Resolutions.” Urban Ecosystems 22 (2): 315–24. https://doi.org/10.1007/s11252-018-0813-3.

Dong Liu, Mei-Po Kwan, Zihan Kan. Analysis of urban green space accessibility and distribution inequity in the City of Chicago. https://doi.org/10.1016/j.ufug.2021.127029

Ohio Department of Health. Data and Statistics. https://odh.ohio.gov/know-our-programs/asthma-program/data-and-statistics